A Two-Phase Spectral Bigraph Co-clustering Approach for the “Who Rated What” Task in KDD Cup 2007
نویسندگان
چکیده
This paper describes our approach for the “Who Rated What” task in KDD Cup 2007 competition. Given the Netflix data set that consists of more than 100 million ratings between 1998 and 2005, this task is to predict the probability that each user-movie pair was rated in 2006. Totally 100,000 user-movie pairs are drawn from the Netflix data set as the test set. In our approach, the Netflix data set is modeled as a bipartite graph (or bigraph) with users and movies on either side. In the bigraph, there are only directed edges from user nodes to movie nodes and each directed edge corresponds to a rating event that the user rated the movie at some time. Then the given task can be further formulated as a link existence prediction problem, i.e., whether a directed link exists between a user node and a movie node. Considering the huge size and the sparsity of ratings in the data set, it is important to reveal the hidden class-based correlation between users and movies from the bigraph while keeping relatively low computational complexity. Towards this end, a two-phase spectral bigraph co-clustering approach is used in our approach. The key idea is to simultaneously obtain user and movie neighborhoods via co-clustering and then generate predictions based on the results of co-clustering. Roughly speaking, our approach includes three steps. First, users and movies are coarsely clustered using K-means algorithm respectively. Then the user and movie clusters are further coclustered using multipartite spectral graph partition algorithm. Based on the results of co-clustering, a probabilistic model is derived to predict the probability of a link existing between a user node and a movie node. Experimental results show that our approach works well in the task.
منابع مشابه
Who Rated What: a combination of SVD, correlation and frequent sequence mining
KDD Cup 2007 focuses on predicting aspects of movie rating behavior. We present our prediction method for Task 1 “Who Rated What in 2006” where the task is to predict which users rated which movies in 2006. We use the combination of the following predictors, listed in the order of their efficiency in the prediction: • The predicted number of ratings for each movie based on time series predictio...
متن کاملImplementation of Fuzzy c-Means and Outlier Detection for Intrusion Detection with KDD Cup 1999 Data Set
In this paper, a two-phase method for computer network intrusion detection is proposed. In the first phase, a set of patterns (data) are clustered by the fuzzy c-means algorithm. In the second phase, outliers are constructed by a distance-based technique and a class label is assigned to each pattern. The KDD Cup 1999 data set is used for the experiment. The results show that, for binary classif...
متن کاملIntrusion Detection based on a Novel Hybrid Learning Approach
Information security and Intrusion Detection System (IDS) plays a critical role in the Internet. IDS is an essential tool for detecting different kinds of attacks in a network and maintaining data integrity, confidentiality and system availability against possible threats. In this paper, a hybrid approach towards achieving high performance is proposed. In fact, the important goal of this paper ...
متن کاملTaxonomy-Informed Latent Factor Models for Implicit Feedback
We describe an approach based on latent factor models to the Track 2 task of KDD Cup 2011, which required learning to discriminate between highly rated and unrated items from a large dataset of music ratings. We take the pairwise ranking route, training our models to rank the highly rated items above the unrated items which are sampled from the same distribution. Using the item relationship inf...
متن کاملA Hybrid Framework for Building an Efficient Incremental Intrusion Detection System
In this paper, a boosting-based incremental hybrid intrusion detection system is introduced. This system combines incremental misuse detection and incremental anomaly detection. We use boosting ensemble of weak classifiers to implement misuse intrusion detection system. It can identify new classes types of intrusions that do not exist in the training dataset for incremental misuse detection. As...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007